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1. Introduction 


For many applications, especially those in which a computer is 
controlling a real-time process (e.g. telephone switching, flight 
control of an aircraft or spacecraft, control of traffic in a transpor- 
tation system, etc. ), it is desirable to continuously monitor the 
performance of the system, as it is being used, to determine whether 
its actual behavior is tolerably close to the intended behavior. It is 
this sort of monitoring which we mean by the term "on-line di^nosis. " 
Implementation of on-line diagnosis may be external to the system, 
both internal and external, or completely external. In the last 
extreme, on-line diagnosis is sometimes referred to as "self- 
diagnosis" or "self-checking" ([1] , [2]). 

On-line diagnosis plays a very important role in almost every 
ultra- re liable computer system which has ever been proposed (see [2], 
[3], or [4] for example), and a lesser but still important role in many 
conventional systems. For example, the IBM System/360 utilizes 
checking circuits to detect errors [ 5] . The signals generated by these 
circuits are used in some models to freeze the computer so that the 
instruction which was currently executing may be retried if possible, 
and to assist in the checkout and repair of the computer if the auto- 
matic retry attempt fails. tJltra-reliable computers typically use the 
signals generated by the monitoring device to provide the computer 
system with the information it needs to automatically reconfigure 
itself so as to avoid using any faulty circuits. One other use for such 


1 
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signals is to simply inform the system user that the system is not 
operating properly and that there may be errors in his data. 

In general, on-line diagnosis is used to verify that the system 
is operating properly; or conversely, to signal that it is in need of 
repair. In most computer systems this task is also performed in some 
part by off-line diagnosis. By off-line diagnosis we are referring to 
the process of removing the system from its normal operation and 
applying a series of prearranged tests to determine whether any 
faults are present in the system. There are major differences between 
on-line and off-line diagnosis and it is important to be aware of the 
capabilities and the limitations of each. 

One basic difference is that on-line diagnosis is a continuous 
process whereas off-line diagnosis has a periodic nature. Due to this 
only permanent faults can be diagnosed with off-line diagnosis be- 
cause if a fault is transient in nature it may not be in the system 
when it is tested. On the other hand, since on-line diagnosis is a 
continuous monitoring process both permanent and transient faults 
can be diagnosed. Also, with off-line diagnosis the system must be 
removed from its normal operation to apply the tests and this may not 
be acceptable in a real-time application. 

The cost of either form of diagnosis depends on the nature of 
the system to be diagnosed, the technology to be used in building the 
system, and the degree of protection against faulty operation that is 
required. With on-line diagnosis the cost is almost totally in the 
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design and construction of extra hardware. With off-line diagnosis the 
cost is in the initial generation of the tests and in the subsequent 
storage and running of these tests. 

In general, off-line diagnosis is useful for factory testii^ and 
for applications where immediate knowledge of any faulty behavior 
is not essential. Off-line diagnosis is also useful for locating the 
source of trouble once such trouble is indicated by on-line diagnosis. 

For example, Bell System’s No. 1 ESS[4] uses duplicate processors 
to continually check one another and once a discrepancy is detected 
off-line diagnosis is used to determine which processer exhibited 
the erroneous behavior and to locate the faulty module in that 
processer. 

In the MARCS study [2j a more integrated use of on-line diagnosis 
is proposed whereby a number of checking circuits observe the per- 
formance of various parts of the computer. With a scheme such as 
this information about the location of a fault can be obtained from 
knowledge of which checking circuit indicated the trouble. 

Both forms of diagnosis have been used to check the operation 

\ 

of computers from the very first machines until the present time. 

In a short paper published in 1957 Eckert [6] informs us that off-line 
diagnosis was relied upon for the ENIAC computer, that the BINAC 
system had duplicate processors, and the the UNIVAC used a more 
economical on-line diagnosis scheme involving 35 checkir^ circuits. 
During the past decade, however, the development of theory and 
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tBchniques for fault diagnosis in digital systems and circuits have 
focused mainly on problems of off-line diagnosis (see [1] and [7] for 
example). 

The work that has been done on on-line diagnosis is mainly in 
the area of techniques. One early paper is Kantz's study [8j of fault 
detection techniques for combinational circuits. In this paper he 
investigated a number of techniques including the use of codes and 
the possibility of greater economy if immediate detection of errors 
was not necessary. Many of the more common on-line diagnosis 
techniques have been gathered together and published in a book by 
Sellers, Hsiao, and Beardson [ 9] . Much of what is in this book and a 
large portion of the techniques that can be found elsewhere in the 
literature are concerned with special circuits such as adders and 
counters. For example, see the papers by Avizienis [ 10] , Rao [ 11], 
and Dorr [ 12] . 

Relative little work can be found on the theory of on-line 
diagnosis. In one of the earliest works of a theoretical nature 
Peterson [ 13] showed that on adder can be checked using a com- 
pletely independent circuit which adds the residue, module some base, 
of the operands. He went on to show that any independent check of this 
type was a residue class check. Another interesting theoretical 
result was published by Peterson and Rabin [ 14] . They showed that 
combinational circuits can differ greatly in their inherent diagnosability 
and that in some cases virtual duplication is necessary. A later and 
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more general paper is that of Carter and Schneider [15] . They propose 
a model for on-line diagnosis which involves a system and an external 
checker. To be on-line diagnosable the system must produce non- 
code outputs when it fails and the external checker must signal the 
occurrence of such an output. The checking circuits that they 
consider indicate the presence of faults in the checkers themselves in 
addition to faults in the systems they are monitoring. 

With decreasii^ cost of logic and the increasing use of com- 
puters in real-time applications where erroneous operation can 
result in the loss of human life anchor large sums of money the use 
of on-line diagnosis can be expected to increase greatly in the 
near future. The importance of this area along with the relative 
lack of theoretical research is our motivation for initiating this study 
of on-line diagnosis. 


I 



2. Discrete-Time Systems 


On-line diagnosis is inherently a more complex process 
than off-line diagnosis because of two complicating factors: i) it has 
to deal with input over which it has no control and ii) faults can occur 
as the system is being diagnosed. We would like to build a theory of 
on-line diagnosis using conventional models of time-invariant (stationary, 
fixed) systems (e.g. sequential machines, sequential networks etc. ) . 
However, due to the second factor mentioned above these conventional 
models can no loiter be used to represent the dynamics of the system 
as it is being diagnosed. A system which is designed and built to behave 
in a time- invariant manner becomes a time-varying system as faults 
occur while it is in use. Therefore, a more general representation 

Phased on time-varying systems is required. Based on this fundamental 

'\ 

\servation we have developed what we believe to be an appropriate 
m\iel for the study of on-line diagnosis. 

Definition 1 

Relative to the time -base T = {..., -1, 0, 1, ... }, a discrete- 
time system (with finite input and output alphabets) is a system 

S= (I,Q,Z,6,A) 

where I is a finite set, the input alphabet 
Q is a set, the state set 
Z is a finite set, the output alphabet 
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6: Q X I X T — > Q, the transition function 
A: Q X I X T — > Z, the output function . 

The interpretation of a discrete-time system is a system which, 
if at time t is in state q and receives input a, will at time t emit output 
symbol A(q, a, t) and at time t + 1 be in state 6(q, a, t) . In the special 
case where the functions 6 and A are independent of time (i. e., are time- 
invariant) , the definition reduces to that of a (Mealy) sequential machine. 
In the discussion that follows we will assume, unless otherwise quali- 
fied, that S is finite- state (i. e. , |Q | < oo) . 

To describe the behavior of a system, we first extend the transi- 
tion and output functions to input sequences in the following natural way. 

If I* is the set of all finite length sequences over I (including the null 
sequence A) then: 

6: Q X !♦ X T ^ Q 
where, for all q e Q, a e I, t e T: 

6(q, A, t) = q 
6(q, a,t) = a(q, a,t) 

^(q, a^ag. . . a^, t) = 6(5(q, a^a^, • • t) , a^, t + n- 1) . 

Similarly, if I^ = I* - {a}: 

A: Q X I^ X T ^ Z 
where, for all q e Q, a e I, t e T: 

A(q, a, t) = A(q, a, t) 

A(q, a^ag. . - a^^, t) = A(6(q, a^ag. . . a^_ t) , t + n-1) . 
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Relative to these extended functions, the behavior of S in state 
q is the function 

where ^^(x,t) = X(q,x,t). 

Thus, if the state of the system is q and it receives input sequence x 
startii^ at time t, then /3^(x, t) is the output emitted when the last 
symbol in x is received (i. e. the output at time t + |x | - 1 ( |x | = 
length (x) ) ) . 

Many investigations of on-line diagnosis and fault tolerance have 
studied redundancy schemes such as duplication and triplication. 
Typically they have not dealt with the problem of starting each copy 
of a machine in the same state. In this study we will be examining 
these schemes and others for which the same problem arises. Since 
many existii^ systems have reset capabilities, and since this feature 
solves the above synchronizing problem we will use a special type of 
system for which the reset capabilities are explicitly specified. This 
explicit specification of the reset capability is essential since it is an 
important part of the total system and is just as subject to faults as 
any other portion of the system. 

Definition 2 

A resettable discrete-time system ( resettable system ) is a 
system 
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S= (I,Q,Z,6,A,R,p) 

where {I»Q, Z, C, A) is a discrete-time system 

R is a finite nonempty set, the reset alphabet 
p: R X T — > Q, the reset function. 

A resettable system is resettable in the sense that if reset r is 
applied at time t - 1 then p(r,t) is the state at time t. This method of 
specifying reset capability is a matter of convenience. This feature 
could just as well have been incorporated as a restriction on the transi- 
tion function relative to a distinguished subset of input symbols called 
the reset alphabet. Thus a resettable discrete-time system can indeed 
be regarded as a special type of discrete-time system. If 5, A, and p 
are all independent of time the definition reduces to that of a resettable 
sequential machine . Thus a resettable machine can be viewed as a 
resettable system which is invariant under time-translations. 


Given a resettable system we can view it as a system organized 
as in Figure 1. 



Figure 1 Schematic Diagram for S = (I, Q, Z, 6, A, R, p) 







10 


We will represent sequential machines in the usual manner, i. e. 
via transition tables or state graphs. Resettable machines are repre- 
sented by minor extensions of these two methods. The transition table 
of a resettable machine -is identical to that of a machine with the addi- 
tion of one column on the right to accommodate the reset function. If 
p(r) = q then r will appear in the last column of the q row. Similarly, 
the state graph of a resettable machine is identical to that of a machine 
with the addition of one short arrow for each r c R. This arrow will 
be labeled r and will point to state p(r). 

Example 1 

Let be the sequence generator with reset alphabet {O} and 
input alphabet {l} which has been implemented by the circuit in Figure 
2 . 



Figure 2 Circuit for 

Then the transition table and the state graph for are as shown in 
Figures 3 and 4. 
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iQ. :1 1 R 

00 Ol/Oj' O 

01 11/1 li 

10 1 00/1 |i 

11 1 lOA 

Figure 3 Transition Table for 



Figure 4 State Graph for 

The circuit in Figure 2 is also an implementation of a similar machine 
Mg with input alphabet { 0, 1} . The state graph for Mg is shown in 



Figure 5 State Graph for Mg 



12 


Thus, in Mg the input symbol ’’O” can be intrepreted as a regular 
input or as a reset input. In Mg the outputs for input 0 are explicitely 
specified whereas in M^ they may be regarded as classical "don’t 
cares. " 

In general, we have no convenient representations for discrete- 
time systems and resettable systems. About all we can do is specify 
each of the functions 6, A, and p explicitly. However, most of the 
systems that we will deal with will be truly time-varying at only a few 
points in time and thus can be described by the machines they 
resemble in the intervals between these points. 

Example 2 

Suppose that Mj was implemented as in Figure 2 and that this 
circuit operated perfectly up to time 100 when gate 2 became stuck- 
at-0. What actually existed was not a resettable machine but a (time- 
varying) resettable system S which looks like M^ up to time 100 and like 
a different machine, say M^, thereafter. The graph for MJ is shown in 
Figure 6. 


1/1 

Figure 6 Resettable Machine M’^ 
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We can represent S as follows: 

g Jm^ for t < 100 
\mJ for t > 100. 

By this we mean that I = 1^ = and likewise for Q, Z, and R, and 
that 

6(q,a,t) = 
and similarly for X and p. 


6j(q,a) for t < 100 
6j^(q, a) for t > 100 


For resettable systems we take the definiticns of 6, A, and B 

* ’ ^q 

to be the same as those for systems. It is also convenient in the case 
of resettable systems to specify behavior relative to a reset input r 
that is released at time t, that is, the behavior of S for condition (r,t) 
(re R, t € T) is the function 

>2 

where 




If t = 0, Q is referred to as the behavior of S for initial reset r and 
is denoted simply as /3^. 
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3. Realizations 

Discrete-time systems are a straightforward generalization of 
sequential machines and many notions that we are familiar with in the 
context of sequential machines can be generalized in a similar manner 
to apply to discrete-time systems. In this section we will look in some 
detail at the generalized notion of a realization. As in other sections, 
our emphasis here will be toward those aspects of the theory that will be 
useful to us in our study of on-line diagnosis. We begin by stating Meyer 
and Zeigler 's definition of realization for sequential machines [ ] . 

Definition 3 

If M and M are sequential machines then M realizes M (written 
if there is a triple of functions (oj, 02 , a^) where aj:(I) — ^I"*" 

is a semigroup homomorphism such that CTj(I) c i, Q ->Q, 

' 

Ug: Z’ Z where Z' c Z, such that for all q e Q and all x e (f) 

It has been shown by Leake [ 17 ] that this strictly behavioral 
definition of realization is equivalent to the structurally oriented defini- 
tion of Hartmanis and Stearns [ 18 J - 

The followii^ definition extends the above notion in a natural 
manner to include discrete-time systems. 

Definition 4 

If S and S are two discrete-time systems then S realizes S (SpS) 
if there is a triple of functions (ctj, Og, ag) where (I) is a 
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semigroup homomorphism such that aj(f) cl, cr^: q_^q, cr^: Z'—>Z 
where Z* c Z, such that for all q € Q, for all t e T, and for all x e (f)'*’, 
^g(x, t) = <^ 3((3 (Oj(x) , t) ) . 

A 

If S and S are resettable systems our definition of realization is 
somewhat different. Inherent in this definition is our presupposition 
that a resettable system will be reset before every use. 


Definition 5 

If S and S are two resettable systems thenS realizes S (SpS) if 
there is a triple of functions (o^, <^ 3 ) where (I) I"*" is a 

semigroup homomorphism such that cTj(I) c I, Og: R— >R, Z' — >Z 

where Z' c Z, such that for all r e R, for all t e T, and for all x e (I) 
fts,_t(x) j(aj(x))). 


In the case where S and S are time- invariant resettable systems 
(i. e. , resettable machines) all mention of time can be deleted from the 
above definition. 

Thus for each r e R and t e T the behavior of S for condition 
(a 2 (?),t) is the same (modulo input encoding and output decoding) as 
the behavior of S for condition (r, t) , 


Example 3 

Let Mg and Mg be the resettable machines shown in Figures 


7 and 8 , 
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Then M^pMg under the triple (a^^, a2, cr^) where aj^;(I 3)+ > I3+ 

is the identity, Ug- R3 ^ R3 is defined by 0^2^^) ~ 

^ 3 * ^3 ^ ^3 identity. To verify this claim we 

^3 3 ^ 

need only observe that / 3 Jx) = / 3 ^ (x) for all x e(L)'^. 

r r o 

Notice that the definition of realizes for resettable 
systems is less restrictive than that for discrete-time systems 
in the sense that where they are both resettable we only 
require the realizing system to mimic the behavior of the 
reset states of the realized system; while in the case where 
they are not resettable the realizii^ system must mimic 
the behavior of every state of the realized system. On 
the other hand, the definition in the resettable case is more 
restrictive in the sense that for each reset state in the 
realized system not only does there exist a state in the 
realizing system which mimics its behavior, but we also 
know how to get to that state. 

For the special case of time- invariant resettable systems (i. e. , 
resettable machines) the above remarks will be made more precise 
in the following result which is analogous to the result due to Leake 
that we have cited earlier. Let M be a resettable machine. The 
reachable part of M is the set 

{p e Q|p = 6 (p(r),x) for some r e R, x e I*}. 
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A machine M is jg- reachable if any state in the reachable part of M 
can be entered into by a reset alone or by a reset followed by an input 
sequence x with |x | <jg. Clearly, any resettable machine M is 
( |Q I - 1 ) -reachable- 


Example 4 

The reachable part of (see Example 3) is { s , s„, s,}. M 

o 0 ^ u 3 

is 2 -reachable since Pg(rj) = s^, 6 g(pg(r^),l) = s^, and 6 g(pg(rj), 11 ) = 83 . 


Theorem 1 

Let M and M be two resettable machines. Let P and P be the 
reachable parts of M and M. Then M realizes M if and only if there 
exists a 4-tuple of functions ^ 4 ^ where 

Vy 1->1 

^ 2 ’ ^ ~ ^ 

7)^* Z — ^ Z 

Tj^l R — ^ R 

such that 

i) S{t 72 (p) ’ ^ ^ all P e P and a e I 

ii) 7 ? 3 (Mp, »?i(a) ) ) = X(p, a) for all p e P, a c 1, and p € 
ill) e r? 2 (p(r)) for all r e R. 


Proof : (Necessity) 
triple of functions (ct 




Assume MpM. Then there exists an appropriate 

j, ff 2 , ^ 3 ) such that ?~(x) = 03 (^^ (Oj(x) ) ) . 

2 


Therefore 


^~(y){uv) = <' 30 p(„^(J),(c'i(uv))) 
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for each r e R, u e I*, and v e I'^. 

Hence, 

^a(p(r) , u) ('') = ) , cTj(u) ) ('’!<''> > > ■ 

Thus for each p e P there is a p c P such that 

^~(v) = Of3{/3p(<rj(v) ) ) . 

Consider 7)2'. P (^{P) - 0 defined by 

^2^P) = {P e P I CT3(/3p(aj(v) ) ) = ^p(v) , for all v e 
and consider 77^: f-> i defined by 

= a|(a) . 

Claim: The 4 -tuple (77^, ct^, a2) satisfy!), ii), and iii) . 

1) Let p e 77 ^(P) • We must show 6 (p, r]^{a ) ) e TigC^Cp, a) ) . 

^ 6 ( 5 ,a)W = 

= cTgOpCa j(xa) ) ) 

Hence, 6(p, 77^(a) ) e t72(6(p, a) ) . 
ii) Let p e t72(p) • We must show c73(A(p, 77^(a) ) ) = A(p, a) . 
Mp, a) = 5~(a) 

= cT 303 p( 77 i(a))) 

= CT 3 (MP,T?i(a))) . 
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iii) Let r c R. We must show pia^ir) ) e rj^ipir) ) . 

i3~(x) = ) ) implies that 

P(ff2(^) ) ^ ) • 

(Sufficiency) Suppose there exists functions (r/ V^) as in the 
statement of the theorem. Let (I) — > i"*" be the natural extension 
of 77 j to sequences. I. e. , . . a^) = 77 ^(a^) . . . 77 j{aj^) . 

Claim: MpM under (o j, 77 ^, 77 ^) . 

Consider C P P where 

rn.r 

C(p) = some p € 772(S such that 
Piv^ir) ) = 0p(r) ) . 

Let X = ya where a e L Then 

= >?3(A(6(«p(;)),crj(y)),aj(a))) 

= t? 3 (^(P, CFj(a) ) ) where p e 772(6(p(r) , y) ) 

= A(6(p(r) , y) , a) 

= Pp(;) (ya) 

= %(x) . 

This completes the proof of Theorem 1. 

In this study we will not be concerned with the more general 
theoretical aspects of realizations. What we desire from realizations 
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is the following. Given a resettable system S we will want to find a 
resettable system S such that S can do every thing that P can and R 

has the on-line diagnosis properties that are needed. Generally 
we will think of S as having two sets of output terminals; one which 
is used in place of the output terminals of S and the other which is 
used solely for diagnosis. 

To formalize this notion of a system havii^ more than one set 
of output terminals we introduce the notion of a structured set. As 

defined by Zeller [ 19], a set k is structured by injecting it into 
a cross product of an indexed family {Kji € N}. In what follows we 
will take N to be a finite ordered set such as the first n integers. 
Thus a structure assignment is a one-one map from K into 
Normally we do not mention this map explicitly but will consider K 
(once structured) as a subset of K^. Given a structured set K a 
family of coordinate projections {P. Ji e n} where P. ; K— > K. is 
defined by 

p,(k ,k ,k )=k 

1 1 J n 3 

With these notions in mind the special type of realization which will 
be used in our theory of on-line diagnosis can be presented. 
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Definition 6 

Let S and S be two resettable systems with Z structured so 
that Zc Z, X Z.^. Then S d-realizes S(S S) if S S under the 

— 1 p 

triple of functions (0^,0^, a^) where 0^3 = <^3 ° for some 
^3 • ^ 1 — ^ ^ ' 

I. e. , S S if S S and the output decoding is independent of the 
Pd P 

second coordinate of Z. In this case Z^ is called the principle 

output and is given the more mnemonic name Z„ and Z., is called 

the augmented output and is given the name Z .. Thus. Zc Z_x7 . 

A —PA 

Given that S S we can define two new functions associated with 
^d 

^r,t’ behavior of S for condition (r,t). The first one will be the 
behavior function of S with respect to the output terminals which are 
used to mimic S and the second will be the behavior function of S with 
respect to the output terminals which are used solely for diagnosis. 

More precisely, the principle behavior of S for condition (r.t) is the 
function 



where 

Vj. (.(x) = for each xe 



A 


r,f 


or more compactly, 
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Th© augmsnted behavior of S for condition (r^ t) is the function 


a ■ I 
r,f 


where 


a 


r.t 


= PoO 


r,t' 


Thus ^(x) - (y^^^(x) , ^(x) ) for all x e I"^. We now extend these 

functions in a natural way. For r e R and t e T let 


Kv 


where for all a^ ... a € I'*’ 
1 n 


A 


^r, t<^r • ■ = ^r, ‘ ’ ' V 


A A 

Likewise let y and a . denote the natural extensions of y , and 
<^r t ^^spectively. 
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4. Resettable Systems with Faults 

Our model of a '^resettable system with faults" is a speciali- 
zation of Meyer's general model of a "system with faults" [20]. 

Informally, a "system with faults" is a system, 
along with a set of potential faults of the system and 
description of what happens to the original system as 
the result of each fault. The original system and the 
systems resulting from faults are members of one of 
two prescribed classes of (formal) systems, a "specifi- 
cation" class for the original system and a "realization" 
class for the resulting systems. More precisely, we 
say that a triple (c?,dl,p) is a (system) representation 
scheme if 

i) is a class of systems, the specification 
class , 

ii) (S is a class of systems, the realization 
class , 

iii) p: (R cJ where, if R e(R, R realizes 
P(R) • 

By a class of systems, in this context, we mean a class 
of formal systems, i. e. a set of formally specified struc- 
tures of the same type, each having an associated behavior 
that is determined by the structure [20 ]. 

In this study we are concerned with the reliable use of a 
system. I. e., we are concerned with degradations in structure 
which Meyer calls "life defects". This is contrasted with reliable 
des^n in which case we would be concerned with "birth defects". 
Thus, in our case, a specification is a realization and we choose 
a representation scheme (ft = ((ft,(ft,p) where p is the identity 
function on (ft. 
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Assuming that a faulty resettable system has the same 
input, output, and reset alphabets as the fault-free system S, 
the following class of resettable systems will suffice as a reali- 
zation class; 


cy(I,Z,R) = {S'|S* = (I,Q%Z,6',X’,R,p')}. 

In summary, the representation scheme that we are choosing for 
our study of on-line diagnosis is the scheme ((R,(R,p) where 
= e5’(I, Z,R) and p is the identity function on (R. 

In such a scheme the seemingly difficult problem of describing 
faults and their results becomes relatively straightforward. Before 
we state our particular notion of a fault and its results we will 
repeat here Meyer’s general notion of a "system with faults" 

[ 20 ]. 


A system with faults in a representation scheme 
(cy,(fi ,p) is a structure (S, F,q!)) where 

i) S e cS* 

ii) F is a set, the faults of S 

iii) <{>: F — XR such that, for some f e F 

pm)) = s. 

n f e F, the system is the result of f. If 

p(S^) = S then f is improper (by iii) , F contains at 
least one improper fault) ; otherwise it is proper. A 
realization S is fault-free if f is improper; otherwise 
S is faulty [20 1- 
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In applying this notion to our study we must first define what 
we mean by a fault of a resettable system. Given a resettable system 
S € ey(I, Z, R), a fault f of S can be regarded as a transformation of 

5 into another system S’ e ^^(I, Z, R) at some time t. Accordingly, 
the resulting faulty system looks like S up to time r and like S’ 
thereafter. Since S may be in operation at time t we must also be 
concerned with the question of what happens to the state of S as 
this transformation takes place. We handle this with a function 

6 from the state set of S to that of S'. The interpretation of 6 is that 
if S is in state q immediately before time t then S’ is in state 0(q) at 
time T. More precisely, 

Definition 7 

If S e cS‘(I, Z, R), a fault of S is a triple 

f = (S‘,T, e) 

where S' e «y(I, Z, R), je T, and 0: Q — >Q\ 

Given this formal representation of a fault of S, the resulting 
faulty system is defined as follows. 

Definition 8 

The result of f = (S’,t, 6) is the system 

^ = (I,Q^Z,6^A^,R,P^) 
f 

where Q = Q UQ ' 
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'"6(q, a, t) if q e Q and t < r - 1 
/ 0(6(q, a, t) ) if q € Q and t = r - 1 
^6'(q, a,t) if q e Q’ and t > r 

f A(q, a, t) if q e Q and t < t 
\^A’(q , a, t) if q € Q' and t > t 

{ p(r,t) if t < T 
9{p(r,t) ) ilt = T 
p'(r,t) if t >T. 

(Arguments not specified in the above definitions' may be assigned 
arbitrary values. ) 

In justifying this representation of the resulting faulty system 
one should regard a fault f = (S% t, 6) as actually occurring between 

f 

time T - 1 and t. Note that, for any fault f of S, S e cS‘(I, Z,R), 
Example 5 

Recall that in Example 2 Mj was transformed into at time 
100. We would say now that f = (M^, 100, e), where e is the 
identity function, is a fault of and that S is the result of 
f (i. e. , S = M^) , 

Example 6 

.^ain consider as implemented by the circuit in Figure 2 
and let g be the fault which is caused by d^ becoming stuck- at- 1 at 


6^{q, a, t) = 


^^q,a,t) = 
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time 50. Then g = (M’j', 50,0) where MJ* is an indicated in Figure 9 and 
0 : Qj^ — is defined as follows: 


0 

1/1 

Figure 9 Resettable Machine M’’ 

Mj will behave as up to time 50 and thereafter it will produce 
a constant sequence of I's. 

To complete the model, a resettable system with faults, in this 
representation scheme, is a structure 

(S, F,0) 

where S e cy(I, Z,R), F is a set of faults of S including at least one 
improper fault (e.g, , f = (S, 0, e) where e is the identity function), 
and 0: F cS*(I, Z, R) where 0(f) = S^, for all f e F. Given this 
definition, we can drop the explicit reference to 0 in denoting a 
resettable system with faults, i. e. , (S, F) wiU mean (S, F, 0) where 




0 is as defined above. 
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In the remainder of this study we will be dealir^ almost exclusively 
with resettable systems. Thus we will refer to resettable systems 
simply as systems and to resettable machines as machines. 

A word is in order about our definition of faults. The 
interpretation here is one of effect, not cause, e. g. we don't 
talk of stuck- at- 1 OR gates but rather of the system which is created 
due to some presumed physical cause. We will refer to these physical 
causes as component failures or simply as failures. A fault, by our 
definition, consists of precisely that information which is needed to 
define the system which results from the fault. This allows us to treat 
faults in the abstract; independent of specific network realizations of 
the system and without reference to the technology employed in this 
realization and the types of failures which are possible with this tech- 
nology. We are insured, however, that for each fault we have enough 
information to access the structural and behavioral effects of the fault; 
in particular as these effects relate to fault diagnosis and tolerance. 

There are limits, however, to how much can be done with a 
purely effect oriented concept of faults. When a system is sufficiently 
structured to allow a reasonable notion of what may cause a fault we 
certainly will want to make use of this notion. When this is the case 
we may, through an abuse in language, refer to a specific failure at 
time T as a fault. What we will mean is that we have stated a cause 
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of fault and that there is a unique fault which is the result of this 
failure at time t. 

It is interesting to see what the scope of our definition of fault 
is in terms of the types of failures which will result in faults. Recall 
that a fault f of a system S is a triple, f = (S’, t, 6) , where S’ e Z, R) . 
Thus S’ is a (resettable) system with the same input, output, and reset 
alphabets as S. The previous sentence contains, implicitly, almost 
every restriction that we have put on faults. First of all, S’ is a 
(resettable) system. Thus it remains within our universe of discourse. 
In particular, its reset inputs still act like reset inputs. I. e. , they 
cause S’ to go into a particular state regardless of the state it was in 
when the reset input was applied. The restrictions on the input, output, 
and reset alphabets are reasonable since after a fault occurs the system 
presumably will have the same input and output terminals as it had 
before the fault occurred. 

We see that since a fault f is a triple (S’,t, 9 ) with S’ a (time- 
varying) system that we will have considerable latitude in the types 
of causes of faults which we may consider. In particular, we may 
consider simultaneous permanent failures in one or more components, 
simultaneous intermittent failures in one or more components, or any 
combination of the above occurring at the same or varying times. For 
example, a fault f may be caused by an AND gate becoming stuck- at- 1 
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at time r^, followed by an OR gate becomii^ stuck-at-0 at time T 2 - 
Our main interest will be the case where the fault is caused by the 
failure of only one component, since usually such a failure will be 
diagnosed before a second failure occurs. In the case where a fault 
of a machine M is caused by a permanent failure of one or more 
components at only one time f will be of the form 0) . 

Let us now compute the behavior of in state q. Let x = a^. . . a^ e 

Then 


/3*(x.t) = A'(q,x,t) 
f -“f 

= A (6 (q, a^. . . a^_ t) , a^^, t + n- 1) . 

There are three cases which must be considered. 

Case i) q € Q and t + n-1 < t. Then 

/3q(x, t) = A(6(q, a j. . . a^_ t) , a^^, t + n- 1) 

= ^q(x, t) . 

Case ii) q e Q, t + n-1 >t, and t < t. Say t + n-m = t. Then 

<(*- 1) = rmem, a,, , . . , t) ) . . . a„. ,, 

t + n-m) , a^, t + n-1) 


= ^e(5(q, aj. . . t) ) <^n-m+r ' ‘ »n’ ‘ 

'^e(6(q,y. where y = a,. . . 


rn 


and 2 = a - , . , a . 

n-m+1 n 
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Case iii ) q e Q’ and t >t- Then 

= A'(8'(q,aj...aj,.j,t),a„,t + n-l) 

Thus We have proved: 

Theorem 2 

Let S be a system and f = (S’, t, 9) a fault of S. Then for each 
t c T aiid X e I'*’ 

^^q(x,t) if q eQ andt + |x| <t 

J3^(x,t) = J ^ + 1^1 >'T, andt <T 

where x = yz and jy | = t - t 
/3’(x,t) if q e Q’ andt >T. 

v* 4 “ 

(As in the definitions of 6^ and arguments not specified may be assigned 
arbitrary values. ) 

Corollary 2. 1 

Let S be a system and f = (S', t, 6) a fault of S. Then for each 
r e R, t € T, and x e 

fi^) if t + |x I < T 

^0(6(p(r,t),y,t))^^’^^ 1^1 >'rand 

t <T where 

x = yz and |y | = t - t 

/3p^t^x) if t >T. 
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Proof : By its definition 

’ P(r,t) 

Again we have three cases to consider . 

Case i) t + |x | < t. Then t < t and p^(r, t) 
Therefore by Theorem 2 


= P(r,t) € Q. 


(x,t) 

P (r,t) 




Case ii) t + jx | > t and t < t. If t < t then p^(r, t) = 

P(r,t) e Q and case ii) of Theorem 2 applies with p(r,t) in 
place of q. If t = t then p^(r, t) = 0(p(r, t) ) € Q’ and case , 
iii) of the theorem applies givii^ us 




P (r, t) 




0(6(p(r,t), A, t))^^’ 


Case iii) t >r* In this case p (r,t) = p'(r, t) e Q^ Therefore 

(x,t) = (X,t) 

p(r,t) P^r,tj 


= ^r,tW- 


We have noted that we will often be interested in the physical 
cause of a fault. For example, in a network realization of a machine 
we may be interested in faults which are caused by a specific NAND 
gate becoming stuck- at- 1. Since this gate failure results in different 
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faults as we consider it occur rii^ at different times it seems natural 
to give a name to this family of faults. More generally, we will define 
an equivalence relation on a set of faults such that a family of faults 
such as we have just mentioned will be an equivalence class. 

Definition 9 

Let F be a set of faults of a system S and let f^ = 9^) 

and f^ = (S2, Tg, 0 ^ be in F. Then f ^ is equivalent to f^ (f j = f g) if 
S j and Sg are such that 

1) Qi = Q2 

ii) 6 j{q, a, t + Tj) = 62(q, a, t + T2) for all q e Q, a e I, and t £ T 
111) Aj(q, a, t + Tj) = A2(q, a, t + T2) for all q e Q, a £ I, and t e T 
iv) pj^(r,t + Tj) = p2(r,t + T2) for all r £ R, and t € T 
and if = 02* 

We can think of equivalent faults as being time-translations of 
one another. 

Theorem 3 

The above relation is an equivalence relation. 

Proof: It is clearly reflexive, symmetric, and transitive because 
has these properties and because the quantifiers, for all q £ Q etc. , 
are independent of the particular fault. 

— denote then equivalence class of F which contains the 
fault f by [f] p. When the class of faults is clear we will drop the F. 
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Generally if F is not mentioned we take it to be the set of all possible 

faults of a system S. We let f (S., i, (■)) denote the fault in [fj which 

f. 

occurs at time i. When dealing with behaviors ^ will denote the 
f. 

behavior of S and /3^ will denote the behavior of S.. 

From the definition we can see that if f = (M',r, e) where M’ is 
a machine then [f] = {(M', t, e) |t eT}. 

Let f be a fault of a machine M- It is clear from Definition 9 
that f. s f . implies that /3l^(x,t + i) = ^ {x, t + j) for all t e T. Likewise, 

J 4. M 

Since M is time- invariant it is a direct consequence of Theorem 2 and 

the above observation that there is a similar relation between the be- 
L f. 

haviors of M and M More precisely, 

Theorem 4 

Let f be a fault of M and let f . , f . e [f ]. Then for all q e Q, x e 
r e R, and t € T 

f. f. 

^q(x,t + i) = eq(x,t+i) 

^r,t+lW = 


and 
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5. Fault Tolerance and Errors 

Given a system with faults (S, F) and a proper fault f e F, an 
immediate question is whether the faulty system is usable in the 
sense that its behavior resembles, within acceptable limits, that of 
the fault-free system S. We will use the general notion of a "toler- 
ance relation" [20] to make more precise what is meant by "accept- 
able limits. " A tolerance relation for a representation scheme 
is a relation t between (ft and c Six S) such that, for 
all R € (ft, (R,p(R)) e r(i.e. p Cj). In this section we will 
develop the particular notions of "acceptable limits" that we will be 
using in this study of on-line diagnosis. 

At this point in our development we will assume that we are 
given two systems S and S where Sp^S. Thus the principle and aug- 
mented behaviors of S will be defined. More generally, assume that 

we are given any system S with structured output Z c Z x Z . Such 

P A 

a system will be called an output-augmented system. Clearly the 
definitions of principle and augmented behaviors apply to output- 
augmented systems. 

Ifi =\o,T,y) is a fault of S then since the output alphabet of 
is the same as that of S it can be given the same structure, and hence- 
forth we will always assume that this has been done. Accordingly, we 
can compare the principle and augmented behaviors of with those 


of S. 
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Note that any system S can be considered as an output- augmented 
system by considering Z to be Z >< {O}. Given a system S with un- 
structured output alphabet Z we will assume this trivial augmentation 
structure. In this case the principle behavior of S will be identical 
to the behavior of S. 

Definition 10 

Let f be a fault of a system S. Then f is tolerated by S for resets 
at time t if 

^r. t(*) ' li , (x) for each r e R and x e 

r, t Fyl 

In the special case where f is tolerated by S for resets at time 0 we 
will simply say f is tolerated by S. 

Note that this is a very refined notion of fault tolerance. A 
coarser notion, and one more in keeping with the literature, would be 
behavioral equivalence for resets at any time. We prefer our finer 
definition for with it the effects of time can be more naturally analyzed. 
One question which we will study later is: For resets at how many 
(and which) times must a fault be tolerated for it to be tolerated for 
resets at any time? 

Theorem 5 

Let f = (S',T,0) be a fault of machine M. Then f is tolerated by 
M for resets at time t if and only if f^_^ is tolerated by M. 



38 


Proof : is tolerated by M <==> jS^ ^(x) = 



<=> f is tolerated by M for 
resets at time t. 

The second implication follows from Theorem 4 and the hypothesis 
that M is a machine (i. e. , a time- invariant system). 

Thus, f.,l,fj^, ... is tolerated by M for resets at time t^,t 2 ,tj, 
respectively if and only if {f f f . . .} is tolerated by M 

i-tl J-I2 K-tg 

where by F is tolerated by M we mean that each f c F is tolerated by 
M. Due to this we will always consider resets to be released at time 
0 when dealing with fault tolerance of machines and no generality will 
be lost. Clearly, due to Theorem 4 we can do this same sort of thing 
for any other behavioral attribute. 

Example 7 

Let be the sequence generator shown in Figure 10. This 
machine could be implemented by the circuit shown in Figure 11, 


0 



Figure 10 Machine 
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Figure 11 Circuit for M . 

4 

Let f be a fault of which is caused by dj becoming stuck- at- 1 at 
time T. Then f = (M^, r, 6 ) where is the machine represented by 
the graph in Figure 12 and e is as indicated below. 



Figure 12 Machine Ml 

4 

f , 

Consider f_ i. e. the fault (M^, -1, e), and note that ^q\i1) = 1 
whereas /3 q( 11) = 0. Thus f_ ^ is not tolerated by M^. On the other 
hand both and will produce the sequence 00010101. . . when 
reset at -10. Thus f_ ^ is tolerated by for resets at -10. By 
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applying Theorem 5 one can learn that is not tolerated by for 
resets at time i + 1 and that f^ is tolerated by M^. 

Recall that our goal is to develop a theory of on-line diagnosis for 
time - invariant systems and that we have introduced time- varying 
systems only to be able to represent the dynamics of time -invariant 
systems as faults occur. However, it has been the case thus far 
that this theory has generalized in a straightforward manner to a 
theory of on-line di^nosis for time-varying systems. For example, 
we have defined a fault of a system where we could have simply 
defined a fault of a machine, and we have defined a notion of fault 
tolerance for systems. 

From this point on generalizations of this sort will not be valid 
for we will always be considering resets to be released at time 0 and 
for time-varying systems this simplification is not possible. A theory 
of on-line diagnosis of systems could be developed along the line of 
what we will present for machines but we will no longer pursue it. 

Definition 11 

Let f be a fault of a machine M and let g be an arbitrary function 
from Z into some set Then f is g -tolerated by M if for each r in 
R and X in I"*" 

= g(iS^ (x)). 

g = ^ g-tolerated corresponds to behavioral correctness 

with respect to the principle (augmented) behavior and we will use the 
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suggestive term y-tolerated (a-tolerated). If M M under the triple of 
functions -tolerated becomes important for it corres- 

ponds to correctness with respect to the originally specified behavior, 
i. e. , the behavior of M. 

Note that f is tolerated by M implies that f is g-tolerated by M 
for every g. Also, f is tolerated if and only if f is y-tolerated and 
a-tolerated. 

Due to the definitions of the a and y functions (in terms of pro- 
jections composed with the /3 function) definition and theorems concern- 
ing jS can generally be transformed into corresponding definitions 
and theorems which relate to the a and y functions. This is true 
in general for any behavior function of the form g ® /3. When this 
is the case, as in the next definition, only the /3 function will be men- 
tioned explicitly. 

Definition 12 

Let f be a fault of M, re R, and x e Then f with initial reset 
r and input x will cause an error if 

^(x) /3^(x). 

To avoid this cumbersome phrase if j3^(x) ^ we will simply say 

that (f, r,x) is an error , and when it is clear that we are interested 
not only in the erroneous output sequence but also in how it arises we 
will say that ^(x) is an error. 



42 


When we are interested in errors with respect to other behavior 
functions we will use the phrases: y-error , g-error , or most generally, 
g-error. 

Example 8 

Recall that in example 7 f = (M* t,0) was a fault of M. and that 

f.i ^ 

/3q (11) j^)3q( 11). Thus (f_j, 0, 11) i s an error and 01 is the 

erroneous output sequence caused by this error. 

Clearly, ^(x) is an y-error implies ^(x) is an error but not 
conversely. Observe that ^(x) implies ^(xy) ^ ^^(xy) for 

all y € I*. Thus /^(x) is an error implies ?.(xy) is also. 

If y € and a € I are such that ^(ya) is an error but ^(y) 
is not, then ya is a minimal error input for with initial reset r . 

f 

In this case ^^,(x) / /3^(x) where x = ya and we say that (f , r, x) (alter- 
natively, ;/{x)) is a minimal error . 

Note that if f is tolerated then f can cause no errors. Equivalently, 
if there exists r e R and x e I such that ^(x) is an error then f is not 
tolerated. The converse to this is also true. Namely, if f is not 
tolerated then there exist r e R and x e I'^ such that ^(x) is an error. 

Our definition of tolerated Induces a relation t on (R where M^tM 
if and only if f is tolerated by M. If f is improper then = M and 
thus f is tolerated by M. Hence MtM, and therefore t is a tolerance 
relation. Likewise y-tolerated and g-tolerated induce tolerance rela- 
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tions and We say that a fault f is r-diagno sable if f is not tolerated 

by M, (i. e. . Thus f is r-diagnosable if and only if f will cause 

an error for some initial reset r and input x. Finally, we note that 

since f is tolerated implies that f is y- tolerated, as sets t c t , Thus 

- Y 

it is possible to consider faults which are T^-tolerated and T~di^nosable. 

Often we will be in a situation where we are concerned with a 

machine M tolerating a set of faults which are all caused by the same 

phenomenon but which may occur at any time. More specifically, let 

f be a fault of M. We would like a result which assured us that if some 

finite subset of [f] was tolerated by M then all of [fj was tolerated by 

M. Later we will be interested in the same problem with regard to 

diagnosis. The following notion of equivalent errors will be very 

useful to us as we investigate this problem. 

Informally, we will say that two errors (f^ r.,x) and {L, r^, y) 

with i,i >0 are equivalent if they are caused by equivalent faults, 

L f. 

if the inputs x and y are such that M and M ^ will receive identical 

input sequences from time i and time j respectively, and if the initial 

resets r, and r^ and the inputs x and y are such that M with initial 

reset r^ and input x would arrive at time i to the same state to which 

it would arrive at time j given the initial reset r. and the input y. 

L J f. 

In other words, from time i in M and time j in M ^ exactly the 
same thing will happen to exactly the same systems modulo a 
translation in time. More precisely. 
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Definition 13 

Let f = (S', T, 6) be a fault of M and let f f . e [f] with i, j > 0 . 
Let (f., r^x) and (f^, r.,y) be two errors. Then (f., r.,x) is equivalent 
to = (f.,r.,x)) if 

i) X = x^z and y = y^z where |x^ | = i and |y^ | = j, 
ii) 6(p(rp,Xj) = 6(p(r.),yj). 

It is easy to see that this relation is in fact an equivalence rela- 
tion. Le., it is reflexive, symmetric, and transitive. 

The next result shows us one way in which we can manufacture 

equivalent errors and it has an immediate corollary in the realm of 
fault tolerance. This result is a simple consequence of the fact that 

any state which is reachable in an £-reachable machine is reachable 
by time f . 

Theorem 6 

Let f be a fault of an f -reachable machine M and let (f., r,x) 
be an error where i > 0 - Then there exists an equivalent error 
(L,s,y) with 0 <j <£. 

Proof: Let x = XjZ where |x J = i and let q = 6(p(r) , x^) . Since q is 
in the reachable part of M and M is f-reachable there exists s e R 
and y^ e I* such that 6(p(s) ,y^) = q and |y^ | < £. Take j = |y^ | 

and y = y^z. Clearly, (L, s,y) is an error and by its construction it 
is equivalent to (L , r, x) . 
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Corollary 6, 1 

Let f be a fault of an £ -reachable machine M and suppose that 
{f^, . . . ,f^} is tolerated by M. Then {f^, f . . . } is tolerated by M. 

Proof : Assume that f with i > 0 is not tolerated by M. Then there 
exists an error (f^, r,x) . By Theorem 6 there exists an equivalent 
error (f^, s, y) with 0 < j < £. Therefore fj is not tolerated by M. 
Contradiction. Hence, f^ is tolerated by M for all i > 0 . 

Corollary 6. 2 

Let f be a fault of M with reachable part P. Suppose that p(R) = P 
and that f^ is tolerated by M. Then {f^, f j, . . . } is tolerated by M. 

Proof ; Since p(R) = P, M is 0 -reachable. Apply Corollary 6. 1. 

Now we will focus our attention on faults which occur before time 

0 . In the previous results we have excluded this case because if f^ 

and f. are equivalent faults with i or j less than 0 there is, in general, 

3 

no relation with respect to resets at time 0 between the behaviors of 
f. f. 

M ^ and M However, in the important special case where f = 6) 

any e [f] with i < 0 will, with respect to resets released at time 0, 

cause identical behavior. This is because f. = {M', i, 9 ) and by Corollary 
f. 

2, 1, = p^(x) for all i < 0 . 

Theorem 7 

f. 

Let f = (M',t, 9 ) be a fault of M. Then p^\x) = /3^(x) for all 
r e R, x e I'*’, and i < 0. In addition, if f^ is tolerated by M for some 
3 < 0 then L is tolerated by M for all i < 0. 
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Proof: We have already shown the first statement- Thus 
f. f. 

i, j < 0 and clearly one is tolerated if and 
only if the other is tolerated. 

If f = 9) is a fault of M we think of f as affecting the reset 

mechanism of M if p’(r) ^ 0(p(r) ) for some r e R. If this is not the 
case then a further result, similar to Theorem 7, can be obtained. 

Theorem 8 

Let f = 6) be a fault of M and suppose that p’(r) = 0(p(r) ) 

f 

for all r € R. . Then /3^ (x) = ^^(x) for all r e R and x e I^. In addition, 
if L is tolerated by M for some j < 0 then f^ is tolerated by M for 
all i < 0. 

Proof : Since p’(r) = 9{p{r) ) , it is immediate from Corollary 2. 1 
f f. f. 

that (x) = )3^(x) . Therefore (x) = ^^(x) for all i, j < 0 and ^ 
the result follows from this. 

Combining Theorem 7 with Corollary 6. 1 we have 

Theorem 9 

Let f - (M’, T, 9) be a fault of an £-reachable machine M and 
suppose that {f_ , f^j,} is tolerated by M. Then [f ] is tolerated 

by M. 

We finish this section by restating Corollary 6. 2 and Theorem 
8 as a result which in some sense is the best possible. 
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Theorem 10 

Let M be a machine with reachable part P and let f = 0) be a 

fault of M. Suppose p’(r) = 0(p(r) ) for each r in R, p(R) = p, and f. is 
tolerated by M for some j < 0. Then [f] is tolerated by M. , 

Proof : By Theorem 8 L is tolerated by M for all i < 0. Therefore 
f^ is tolerated by M, and thus by Corollary 6- 2 f. is tolerated by M 
for all i > 0. Thus, [f] is tolerated by M. 



48 


6. On-line Diagnosis 

Before we can present our concept of on-line diagnosis in the 
framework that we have built we need one final definition. 

Definition 14 

Let Sj and be two systems. If = Rg and Z ^ c then 
the series connection of Sj and Sg is the system 

~ ^ 2 ’ 

where Q = Qj x 

6((qi,q2),a,t) = (5i(qj,a,t),62(q2,Xj(q^,a,t),t)) 

^((q^>q2) » t) = 

P(r,t) = (Pi(r,t),p2(r,t)). 

Schematically, ♦ S 2 can be pictured as in Figure 13, 



Figure 13 The Series Connection of S. and S 

1 2 
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1 2 

Given a series connection as above we will let /3 , /3 , 

and /3* denote the behavior functions of S2 and Sj ♦ S2 respectively. 
We now state the intuitive result that the behavior of ♦ S2 is equal 
to the extended behavior of composed with the behavior of 82- 


Theorem 11 

Let Sj * S2 be the series connection of with Sg. Let r e 
X e Ip e Qj^, q2 € Qg, and t e T. Then 


and 


at *(x) = I jx) ) . 


^r,t'^^ ^r,e^r,t' 

Proof: We will first derive 6. 


Ai 


Claim: 6((qpq2) ,x,t) = (5j(qpX,t) , 62(q2, )3q^(x, t) , t) ) . 


Proof by Induction: Let |x j ;= 1. Then 


= (6j(q X, t) , 62(q2,^2(qi, ^ 



Assume the result is true for all x in l'*' of length n-1. Let [x | = n 
and X = ya where |y j = n-1. Then 
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6({qii,q2),ya,t) = 8{5((qj,q2) ,y,t) , a,t+n-l) 



y, t) , a, t+n- 1) , t+n- 1) ) 

= (8 j(qj, y3^, t) , 62(62(^2’ ^q (y^> ^) > 1) 

= (^1(^1, ya, t) , 82(q2^ t) , t) ) . 

Having proved our claim the rest follows directly from the definitions. 
Again let |x | = n, x = ya, and |y | = n-1. 

= A(6({qi, q2) , y, t) , a, t+n- 1) 

= M(8j^(qj, y,t) , 62(q2) /3q (y,t) ,t) ) , a, t+n-1) 

= ^q ^j(8 j(qj, y, t) , a, t+n-1) , t+n- 1) 

= ^ 2 ^\{<l 2 y (y,t),t),/3^ (ya, t) , t+n- 1) 

” ^2 ^^2’ 

= l^l (x,t),t) 

mm X 

This establishes the first equation. The second equation is an 
immediate consequence of this one. 
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We are now ready to define our notion of on-line diagnosis. 

This concept involves an external detector D (assumed to be fault- 
free) and a time-delay k within which any error produced by a fault 
must be detected. More precisely, let (M, F) be a machine with faults, 
D a machine with Rjp = R and Z and k a nonnegative integer. 

Then 

Definition 15 

(M, F) is (D, k) -diagnosable if 

i) the behavior of M ♦ D is the constant 0 function for 
each initial reset r e R; 

f 

ii) for each f e F the system M * D is such that if (f, r,x) 
is a minimal y-error then /3*(xy) 0 ^ for all y e I* 

with |y I = k. 

More generally, if,/ is a set of machines then (M, F) is k) -diag - 
nosable if there exists a D in^<0 such that (M, F) is (D, k) -diagnosable. 

Note that i) implies 0 e the output alphabet for D. Each 
z e Zj^ other than 0 is called a fault- detection signal . The choice 
of the symbol "0" to indicate that the machine M is operating 
properly is purely for notational convenience. In general we could 
let any subset of Z^ indicate proper operation and let the complement 
of this set in Z^ be the set of fault- detection signals. In a practical 
application this choice would depend on the design constraints on the 
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detector. 

The two conditions in this definition can be paraphrased as: 

i) The detector should never emit a fault detection signal 
if the machine that it is monitor ii^ is fault-free, 

ii) The detector must emit a fault detection signal within 
k time steps of the occurrence of the first y-error 
produced by the faulty machine, regardless of the 
input after the error. 

Thus, D observes the output of and must make a decision 
based on this observation as to whether has produced a y-error. 

This decision may take some time -- thus the parameter k. The 
complexity of D is a measure of the difficulty of this decision. 

Note that the detector takes no part in the computation of the output 
of M. 

The on-line diagnosis problem can now be stated as: 

Given a machine M, a class of faults F, a class of machines 
and a delay k find an (economical) d-realization M of 

M such that (M, F) is k)-diagnosable. 

Two major types of questions that will be of interest to us are 
questions of existence and economy. Questions of existence will be 
of the form: Given M, , and k does there exist an M such that M 
d-realizes M and (M, F) is (s0, k)-diagnosable? Questions of economy will 
be of the form: Given that (M, F) is (a0, k)-diagnosable can we discover an 
M such that M d-realizes Mand (M, F) is k)-diagnosable where 



53 


is more restricted than ^ and/or k<k. In answering questions such as 
these we seek methods for designing machines with these properties. 

Other fundamental questions are: What time- space tradeoffs 
are possible between the complexity of D ahd the magnitude of the 
time-delay k? The detector, since it is fault-free, can be 
considered as the "hard-core” in this model. Thus, in our last 
question we are inquiring as to the effect of the complexity of the 
"hard-core” on the complexity of the total machine- detector con- 
figuration. 

In the next section we will present some results which will begin 
to answer these questions. We finish this section with two definitions 
which will distinguish two special types of diagnosis. 

Definition 16 

(M, F) is (oD, k) -detectable if it is {S, k)-diagnosable and each 
f € F either is not tolerated or is improper. 

Definition 17 

(M, F) is (k) -self-diagnosable if (M, F) is (D, k)-diagno sable where 
D is the trivial machine which implements the projection P 2 . I, e. , 
the augmented output of M serves as the output of the detector. 

Example 9 

Suppose that a d-realization of (see example 1) is desired 
which is (0) -self-diagnosable for the class of faults F which 

is caused by any failure which affects one delay element. as 



represented by F^re 14 and implemented as shown in Figure 15 is 


such a realization. 
0 



F^re 15 Circuit for Machine Mg 

(Mg, F) is(0)-self-diagnosable because the added delay, dg, acts as a 
parity bit and thus any erroneous value on the output of any of the 
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delays can be detected by the ’’ODD” gate which produces a 1 output 
if and only if the parity of its inputs is odd. 
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7. Preliminary Results 

Here we present a potpourri of results which will begin to 
answer some of the general questions we have posed and which will 
help us to understand the nature of diagnosis as we have defined it. 
The first result of this section shows us that if we allow the detector 
to become as complex as the system it is observing then we have, 
in effect, created an oracle which can diagnose nearly every fault. 


Theorem 12 


Let M be any machine and let M M where M is the machine 

^d 

formed from M by augmentii^ the output with a copy of the input. 


Let F be any set of faults which are crtolerated by M, and let 
be the unrestricted class of all machines. Then (M, F) is (^, 0)- 
diagno sable. 


Proof. Let D be a copy of M along with an equivalence gate which 
produces a 0 if and only if the principle output of M is identical with 
the result as computed by the copy of M in D. Clearly (M, F) is (D, 0)- 
diagnosable. Pictorially, we can view M * D as in Figure 16. 



M D 

Figure 16 Diagnosis Via Duplication in the Detector 
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This result clearly indicates that any further result of interest 
must involve a limitation on the complexity of D anc^/or the amount or 
type of output augmentation allowed. This motivates the following 
extension of our definition of diagnosis. Let M be an output- augmented 
machine with Z c 7 ^ Z^, and let n be a positive integer. Then 

Definition 18 

(M, F) is (</:), k, n)-diagnosable if (M, F) is k) -diagnosable and 

A result similar to Theorem 12 can be obtained by drawing the 
dashed lines in Figure 16 in a different manner as shown in Figure 17. 



Figure 17 Diagnosis Via Duplication in the Realization 

This situation is more realistic than the previous one for now faults 
which may affect either or both copies must be taken into account. 
However, this is still a powerful diagnosis technique since clearly 
any fault which affects only one copy of M and many which affect both 
copies will be diagnosable. 
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The next result will help us to see the relationship between 
fault diagnosis and fault tolerance. 

Theorem 13 

Let M be a machine and F a class of faults such that F is 
y-tolerated by M. Then {M, F) is (D^, 0) -diagno sable where is the 
trivial machine which produces a constant 0 output. 

Proof ; Condition i) is clearly satisfied and condition ii) is trivially 
satisfied since if M y-tolerates F then f can cause no y-errors for 
any f in F. 

The decision in this case can be trivially made since no y-errors 
are ever produced. The situation for tolerated faults is not so simple 
as this result may seem to indicate for it must be remembered that 
y-tolerated does not imply tolerated and thus a y- tolerated fault could 
be detected through an error which only showed up in the augmented 
output. 

We will now develop some results concerning diagnosis which 
are analogous to Corollaries 6. 1 and 6. 2 and to Theorems 7 through 
10. Let D be a detector for a machine M. It will often be the case 
that the second coordinate of the state of M D can be uniquely deter- 
mined from the first coordinate. In particular, this is always the 
case when jQ^ j = l. More formally, the series connection of Mj 
with Mg is synchronized if there exists a function h: Q, such 

X 
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that for each (qj,q2) in the reachable part of * Mg, h(qj^) = qg- 
Such a function is called the synchronizing function of * Mg and 
it must satisfy h(pj(r)) = each r in R, We can now state 

the counterpart of Corollary 6. 1. 

Theorem 14 

Let M be an reachable machine and let D be a detector for 
M such that M * D is synchronized. Suppose that (M, {f^, . . . , f^}) 
is (D, k)-diagnosable. Then (M, {f^j> f • - • }) is (D, k)-diagnos- 
able. 


Proof; Condition i) of Definition 15 is immediately satisfied. Let 

f. 

fj 6 [f] with i > 0, and let y^\x) be a minimal y-error. Since Theorem 

6 applies to y-errors as well as to errors (jS-errors) there exists an 

f. f. 

equivalent y- err or y ^(y) with O < 3 < £. Since y ^(x) is minimal it 

^ S I* 

follows that y ^(y) is also minimal. 

® f. 

Since L is diagnosed by D we know that M ^ * D will produce a 

nonzero output sequence for every input sequence yu with |u[ = k if 

f. 

started with initial reset s. We need only show that M ^ * D with 
initial reset r and any input sequence xu will do the same. 

Let fJ, 11 , and represent the behavior functions of M ^ ♦ D, 


f. 

M D, and M + D respectively. Let x = Xj^z and y = y^z where 

jx- I = i and |y- ( = j. Since the y-errors are equivalent 6(p(r) ,x.) = 

f. f. ^ 

6{p(s) ,yj) . Say 6(p(r) ,Xj) = q. Thus both M ^ and M ^ will be in 



60 


state e(q) at times i and j respectively. Let h: Q ^ Q be the 

f, 

synchronizing function of M * D. Then both M ^ D and M ^ ♦ D will 
be in state (e(q) , h(q) ) at times i and j respectively. Now since D is 
time invariant and since (w, i) = (w, j) for all w in 1+ it 

follows from Theorem 11 that 

^(^(q) , h(q) ) " ^(0(q) , h(q) ) ' 

We know p,g(yu) 0 and clearly the nonzero symbol cannot 
be produced prior to time j. Therefore ^ (zu, j) ^ 0 I 

for all u e 1+ with |u | = k. This implies ^ (zu, i) o I 

and hence /ij.(xu) 0 L Therefore (M, {f^,fp.. .}) is (D, k)- 
diagno sable. 

Corollary 14. 1 

Let M be a machine with reachable part P and let D be a detec- 
tor for M such that M*D is synchronized. Suppose that p(R) = P 
and that is (D, k)-diagnosable. Then (M, {f^, f is (D, k)- 

diagno sable. 

Pro^: p(R) = P implies M is 0 -reachable. Apply Theorem 14. 

Our next two results are analogous to Theorems 7 and 8. 

Theorem 15 

Let f = (M',t, 0) be a fault of M and suppose that (M, f.) is (D, k)- 
dlagnosable lor some j < 0. Then (M, {. .. Is (D,k)-dlagnosable. 
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*‘Z JL . 

Pro;^: By Theorem 7, ^^(x) = ^J{x) for all-i, j < 0. The result is 
an immediate consequence of this fact. 

Theorem 16 

Let f = 0) be a fault of M such that p'(r) = 0(p(r)) for alt 

r € R. Suppose that i(M,f.) is (D, k)-diagnosable for some j < 0. Then 

(M,{.. . ,f_j, f^}) is (D,k)-diagnosable. 

f. £. 

Proof : By Theorems 7 and 8, /3^\x) = for all i,j < 0. 

Combining Theorems 14 and 15 yields 
Theorem 17 

Let M be an £- reachable machine and let D be a detector for M 
such that M * D is synchronized- Let f = (M*, t, 6) be a fault of M and 
suppose that (M, Iq, . . . , f^}) is (D, k)-diagnosable. Then (M,[f] ) 
is (D, k)-diagnosable. 

We terminate this line of development by stating the combination 
of Corollary 14. 1 with Theorem 16. 

Theorem 18 

Let M be a machine with reachable part P and suppose that 
p(R) = P- Let D be a detector for M such that M * D is synchronized. 
Let f = (M', T, e) be a fault of M such that p'(r) = 0(p(r) ) for all r e R. 

If (M,f.) is (D, k)-diagnosable for some j<0 then (M, [f]) is (D, k)- 


diagno sable. 
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The following result shows that under some conditions if the 
output is not allowed to be augmented then there is a restriction on 
the detector which indicates that diagnosis will generally be difficult. 

Theorem 19 

Let M be a machine and f a fault of M which is not tolerated. 
Supposethat (M, f) is (D, k, 1) -diagnosable, and that A(P, I) = Z where 
P is the reachable part of M. Then |Qj^| > 1, 

^ I = 1 then the output of D at any time depends only on 
its input at that time. Since M can product any symbol in Z the output 
of D must be 0 for each input or we would contradict the requirement 
that the behavior of M ♦ D is the zero function. But f is not tolerated 
and (M, F) is {D, k) -diagnosable. Therefore D must be able to produce a 
nonzero output. Contradiction. 

The reason for stating this next result is simply to make note of a 
limitation of self-diagnosis — namely that there are some faults (those 
which cause y-errors but which also cause the fault detection signal to 
be stuck- at=0) that can never be self- diagnosed. 

OA 

Let (M, F) be (k) -self -diagnosable. Then F contains no fault f 
which is not y-tolerated and for which = 0 for all r e R. 


Proof: Obvious. 
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Note that any fault which only affects the reset mechanism is 
tolerated, and thus is di^nosable, if it occurs at or after time 0 . 

On the other hand if such a fault occurs before time 0 it may be rela- 
tively difficult to diagnose. More precisely, 

Theorem 21 

Let f = (M'j T, 0) be a fault of a machine M where t < 0 and 

M* = (I,Q, Z, 6, A,R,p') . Suppose that (M,f) is (D, k)-diagno sable and 

+ '^f 

that there is an r e R and x e I such that y^(x) is a y-error with 
p’(r) e P, the reachable part of M. Then |Qj^[ >1. 

Proof : Assume IQj^I = 1* Then the behavior of * D will be Aj 5 (^(x) ) 

f 

where ^ is the function realized by D. Thus 

for some z e I"*". But ^^(z) = " ^p^^^ where p = p’(r) e P. 

Now p e P implies that there exist m e R and u e I* such that 
p=6(p(m), u). Thus 

Now 

0 implies that Aj^(/3^(uz) ^ ^ 0. 

But this contradicts the hypothesis that (M, f) is (D, k) -diagnosafale. 

Hence |Qj^| >1. 
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8. Possibilities for Further Investigation 

In this report we have taken a fresh look at on-line diagnosis 
from a theoretical point of view. Our first observation was that 
conventional models were not suitable for studying this problem 
and consequently we introduced the notion of a resettable time-varying 
system. With this as our basic model the notions of a fault as a 
transformation of a system S into another system S’ at a time t, and of the 
result of the fault as a system which looks like S up to time t and like S’ 
thereafter came very naturaUy. The companion notions of fault 
tolerance and errors were then introduced and in Section 6 we completed 
our formal model with the definition of { k)-diagno sable. In this 
section we also made the first formal statement of the on-line diagnosis 
problem and we outlined some of the questions that will need to 
be answered to adequately solve this problem. 

In Section 7 we made a start at answering some of these questions 
and at understanding the nature of on-line diagnosis. However, we 
have just begun to scratch the surface of the problem and much 
more work remains to be done. Further work could be carried out 
along the lines presented below. 

Except for some of the examples and for the rudimentary 
structure introduced by output augmentation we have been dealing 
with abstract (i. e. , totally unstructured) systems. Such an approach 
is good for developing formally the concepts involved in our theory 
but some of the questions raised can best be studied in a more 
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structured environment. One reason for this is that with a 
structured system we can consider the causes of faults. For 
example, given an abstract system it makes no sense to speak 
of the set of faults caused by component failures of a certain type 
or by bridging failures. However, given a structured represen- 
tation of a system (e.g. , a circuit diagram) we can discuss these 
and other types of failures (causes) and determine the resulting 
faults (effects). 

There are many different structural levels that could prove 
useful to a further investigation into the theory of on-line diagnosis. 
Three levels which we believe will be important are; the binary 
state- assigned level, the logical circuit level, and the subsystem- 
network level. These levels and the basis for their potential use- 
fulness are explained in the following paragraphs. 

A machine M is said to be binary state- assigned if Q = {0, 1}^ for 
some positive integer n. Given such a machine we can speak of 
stuck- at- 0 and stuck- at -1 and any other type of memory failure. 

The faults corresponding to these failures can be enumerated and 
comparisons can be made between various schemes for diagnosing 
these faults. Memory faults have been studied before in other 
contexts (see [21] and [22] for example) and they are an important 
class of faults for an number of reasons. As we have seen, only a 
limited amount of structure is needed to discuss them. Thus 
memory faults can be analyzed before the circuit design of the machine 



66 


is complete. Also, it is memory which distin^ishes truly sequential sys- 
tem from purely combinational (one- state) systems. Combinational 
systems are inherently easier than sequential systems to analyze 
and a number of techniques for the on-line diagnosis of such systems 
are known (see [ 8] and [ 9] for example). 

A system possesses structure at the logical circuit level if a 
representation of the system is given in terms of a logical circuit 
composed of primitive logical elements. These may be of the 
AND-OR variety, threshold elements, or any similar elements of 
a t>uilding block” nature depending upon the technology being considered. 
This level is useful for investigating failures in the primitive 
components. The circuit in Figure 2 is an example of a structural 
representation at this level and the failure of this circuit discussed 
in example 2 is a simple example of the analysis that can be conducted 
at this level. 

The subsystem -network level is the most general of these three 
levels. In general, any system which is represented in terms of a 
network of subsystems is said to have the subsystem -network level of 
structure. At this level we could study the problem of implementing on- 
line diagnosis on a whole computer whereas with the other levels the 
emphasis would be on diagnosing one module. Note that in our 
definition of diagnosis the detector is not constrained to give simply 
a yes- no response. It could also provide extra information for use 
in automatic fault location. Thus at this level we could study the 
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problem of which subs/stems must be explicitly observed by the 
detector to achieve some desired fault location property. 

One problem that cannot be naturally studied with our model at 
any structural level is the problem of automatic reconfiguration of 
the system under the control of the detector. To study this 
problem our model would have to allow for feedback from the 
detector to the system it is observing and at the present time 
this is not allowed. 
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